A massively parallel corpus: the Bible in 100 languages
Identifieur interne : 000083 ( Main/Exploration ); précédent : 000082; suivant : 000084A massively parallel corpus: the Bible in 100 languages
Auteurs : Christos Christodouloupoulos [États-Unis] ; Mark Steedman [Royaume-Uni]Source :
- Language Resources and Evaluation [ 1574-020X ] ; 2014.
Abstract
We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.
Url:
DOI: 10.1007/s10579-014-9287-y
PubMed: 26321896
PubMed Central: 4551210
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 000067
- to stream Pmc, to step Curation: 000066
- to stream Pmc, to step Checkpoint: 000076
- to stream Ncbi, to step Merge: 000197
- to stream Ncbi, to step Curation: 000197
- to stream Ncbi, to step Checkpoint: 000197
- to stream Main, to step Merge: 000083
- to stream Main, to step Curation: 000083
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A massively parallel corpus: the Bible in 100 languages</title>
<author><name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation wicri:level="2"><nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
<wicri:cityArea>Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation wicri:level="4"><nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>School of Informatics, University of Edinburgh, Edinburgh</wicri:regionArea>
<placeName><settlement type="city">Édimbourg</settlement>
<region type="country">Écosse</region>
</placeName>
<orgName type="university">Université d'Édimbourg</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26321896</idno>
<idno type="pmc">4551210</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210</idno>
<idno type="RBID">PMC:4551210</idno>
<idno type="doi">10.1007/s10579-014-9287-y</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000067</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000067</idno>
<idno type="wicri:Area/Pmc/Curation">000066</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000066</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000076</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000076</idno>
<idno type="wicri:Area/Ncbi/Merge">000197</idno>
<idno type="wicri:Area/Ncbi/Curation">000197</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000197</idno>
<idno type="wicri:doubleKey">1574-020X:2014:Christodouloupoulos C:a:massively:parallel</idno>
<idno type="wicri:Area/Main/Merge">000083</idno>
<idno type="wicri:Area/Main/Curation">000083</idno>
<idno type="wicri:Area/Main/Exploration">000083</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A massively parallel corpus: the Bible in 100 languages</title>
<author><name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation wicri:level="2"><nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
<wicri:cityArea>Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation wicri:level="4"><nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>School of Informatics, University of Edinburgh, Edinburgh</wicri:regionArea>
<placeName><settlement type="city">Édimbourg</settlement>
<region type="country">Écosse</region>
</placeName>
<orgName type="university">Université d'Édimbourg</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Kanungo, T" uniqKey="Kanungo T">T Kanungo</name>
</author>
<author><name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author><name sortKey="Mao, S" uniqKey="Mao S">S Mao</name>
</author>
<author><name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author><name sortKey="Zheng, Q" uniqKey="Zheng Q">Q Zheng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Marcus, M" uniqKey="Marcus M">M Marcus</name>
</author>
<author><name sortKey="Santorini, B" uniqKey="Santorini B">B Santorini</name>
</author>
<author><name sortKey="Marcinkiewicz, Ma" uniqKey="Marcinkiewicz M">MA Marcinkiewicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Och, Fj" uniqKey="Och F">FJ Och</name>
</author>
<author><name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Potthast, M" uniqKey="Potthast M">M Potthast</name>
</author>
<author><name sortKey="Barr N Cede O, A" uniqKey="Barr N Cede O A">A Barrón-Cedeño</name>
</author>
<author><name sortKey="Stein, B" uniqKey="Stein B">B Stein</name>
</author>
<author><name sortKey="Rosso, P" uniqKey="Rosso P">P Rosso</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author><name sortKey="Olsen, M" uniqKey="Olsen M">M Olsen</name>
</author>
<author><name sortKey="Diab, M" uniqKey="Diab M">M Diab</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Wei, Cp" uniqKey="Wei C">CP Wei</name>
</author>
<author><name sortKey="Yang, Cc" uniqKey="Yang C">CC Yang</name>
</author>
<author><name sortKey="Lin, Cm" uniqKey="Lin C">CM Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations><list><country><li>Royaume-Uni</li>
<li>États-Unis</li>
</country>
<region><li>Illinois</li>
<li>Écosse</li>
</region>
<settlement><li>Édimbourg</li>
</settlement>
<orgName><li>Université d'Édimbourg</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Illinois"><name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
</region>
</country>
<country name="Royaume-Uni"><region name="Écosse"><name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Linguistique/explor/TamazightV2/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000083 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000083 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Linguistique |area= TamazightV2 |flux= Main |étape= Exploration |type= RBID |clé= PMC:4551210 |texte= A massively parallel corpus: the Bible in 100 languages }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i -Sk "pubmed:26321896" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd \ | NlmPubMed2Wicri -a TamazightV2
This area was generated with Dilib version V0.6.33. |